37 research outputs found
Evaluating Semantic Parsing against a Simple Web-based Question Answering Model
Semantic parsing shines at analyzing complex natural language that involves
composition and computation over multiple pieces of evidence. However, datasets
for semantic parsing contain many factoid questions that can be answered from a
single web document. In this paper, we propose to evaluate semantic
parsing-based question answering models by comparing them to a question
answering baseline that queries the web and extracts the answer only from web
snippets, without access to the target knowledge-base. We investigate this
approach on COMPLEXQUESTIONS, a dataset designed to focus on compositional
language, and find that our model obtains reasonable performance (35 F1
compared to 41 F1 of state-of-the-art). We find in our analysis that our model
performs well on complex questions involving conjunctions, but struggles on
questions that involve relation composition and superlatives.Comment: *sem 201
In-Context Learning Creates Task Vectors
In-context learning (ICL) in Large Language Models (LLMs) has emerged as a
powerful new learning paradigm. However, its underlying mechanism is still not
well understood. In particular, it is challenging to map it to the "standard"
machine learning framework, where one uses a training set to find a
best-fitting function in some hypothesis class. Here we make progress on
this problem by showing that the functions learned by ICL often have a very
simple structure: they correspond to the transformer LLM whose only inputs are
the query and a single "task vector" calculated from the training set.
Thus, ICL can be seen as compressing into a single task vector
and then using this task vector to modulate the
transformer to produce the output. We support the above claim via comprehensive
experiments across a range of models and tasks.Comment: Accepted at Findings of EMNLP 202
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space
Transformer-based language models (LMs) are at the core of modern NLP, but
their internal prediction construction process is opaque and largely not
understood. In this work, we make a substantial step towards unveiling this
underlying prediction process, by reverse-engineering the operation of the
feed-forward network (FFN) layers, one of the building blocks of transformer
models. We view the token representation as a changing distribution over the
vocabulary, and the output from each FFN layer as an additive update to that
distribution. Then, we analyze the FFN updates in the vocabulary space, showing
that each update can be decomposed to sub-updates corresponding to single FFN
parameter vectors, each promoting concepts that are often human-interpretable.
We then leverage these findings for controlling LM predictions, where we reduce
the toxicity of GPT2 by almost 50%, and for improving computation efficiency
with a simple early exit rule, saving 20% of computation on average
CRoW: Benchmarking Commonsense Reasoning in Real-World Tasks
Recent efforts in natural language processing (NLP) commonsense reasoning
research have yielded a considerable number of new datasets and benchmarks.
However, most of these datasets formulate commonsense reasoning challenges in
artificial scenarios that are not reflective of the tasks which real-world NLP
systems are designed to solve. In this work, we present CRoW, a
manually-curated, multi-task benchmark that evaluates the ability of models to
apply commonsense reasoning in the context of six real-world NLP tasks. CRoW is
constructed using a multi-stage data collection pipeline that rewrites examples
from existing datasets using commonsense-violating perturbations. We use CRoW
to study how NLP systems perform across different dimensions of commonsense
knowledge, such as physical, temporal, and social reasoning. We find a
significant performance gap when NLP systems are evaluated on CRoW compared to
humans, showcasing that commonsense reasoning is far from being solved in
real-world task settings. We make our dataset and leaderboard available to the
research community at https://github.com/mismayil/crow.Comment: 37 pages, camera-ready for EMNLP 202